Fix/longhorn #1596

Despire · 2024-12-04T15:03:28Z

This PR moves away from the current way of volume replication in which the number of replicas is increased and then subsequently decrease forcing to move it to another node while maintaining the StorageClass specified replica count,
to using longhorn setting block-for-eviction-if-last-replica which has the following benefits:

Protects data by preventing the drain operation from completing until there is a healthy replica available for each volume available on another node.
Automatically evicts replicas, so the user does not need to do it manually (through the UI).
The drain operation is only as slow and data-intensive as is necessary to protect data.

With this setting longhorn will try to maintain the replica count defined in the StorageClass (given that there exists the required number of nodes), On node deletion this setting will block the deletion if the node being deleted is the last node that has a heatlhy replica, until a new replica is created on another node after which the deletion will continue as expected.

The tests have been done on dynamic and static nodes where I would delete entire nodepools where all of the replicas were present forcing them to move to another nodepool available. It would happen that from time to time the eviction of the replicas would be stuck for ~10-15mins but it would always manage to continue and not get stuck.

Further, I had to disable the concurrent cordoning of the nodes as I've encountered an issue where the deletion of nodes would deadlock if all of the nodes where all of the replicas would live are to be deleted, opting instead for a one-by-one cordoning and deletion of nodes.

Additionally, another bug was spotted where the StorageClasses for providers defined in the InputManifest were not correctly cleaned-up. Longhorn annotations were also moved to the PatchAnnotations part in Kuber microserver to have a single point where annotations are applied.

# Conflicts: # services/kuber/server/domain/utils/longhorn/longhorn.go # services/kuber/server/domain/utils/nodes/delete.go

JKBGIT1

LGTM 👍

Despire and others added 4 commits December 3, 2024 15:14

update longhorn pvc replication

271c9b5

Merge branch 'master' into fix/longhorn

36dda24

# Conflicts: # services/kuber/server/domain/utils/longhorn/longhorn.go # services/kuber/server/domain/utils/nodes/delete.go

finalize longhorn changes

0bfba24

Auto commit - update kustomization.yaml

a4308a3

Despire requested review from bernardhalas and JKBGIT1 December 5, 2024 07:15

JKBGIT1 approved these changes Dec 6, 2024

View reviewed changes

Despire added this pull request to the merge queue Dec 6, 2024

Merged via the queue into master with commit 956d889 Dec 6, 2024

Despire deleted the fix/longhorn branch December 6, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/longhorn #1596

Fix/longhorn #1596

Despire commented Dec 4, 2024 •

edited

Loading

JKBGIT1 left a comment

Fix/longhorn #1596

Fix/longhorn #1596

Conversation

Despire commented Dec 4, 2024 • edited Loading

JKBGIT1 left a comment

Choose a reason for hiding this comment

Despire commented Dec 4, 2024 •

edited

Loading